Electrophoretic Analysis System Having In-Situ Calibration

ABSTRACT

An electrophoretic system having a plurality of separation lanes is provided with an automatic calibration feature in which each lane is separately calibrated. For each lane, the calibration coefficients map a spectrum of received channel intensities onto values reflective of the relative likelihood of each of a plurality of dyes being present. Individual peaks, reflective of the influence of a single dye, are isolated from among the various sets of detected light intensity spectra, and these can be used to both detect the number of dye components present, and also to establish exemplary vectors for the calibration coefficients which may then be clustered and further processed to arrive at a calibration matrix for the system. The system thus permits one to use different dye sets to tag DNA nucleotides in samples which migrate in separate lanes, and also allows for in-situ calibration with new, previously unused dye sets.

RELATED APPLICATIONS

This is a continuation of U.S. application Ser. No. 12/154,598, filedMay 23, 2008, which is a continuation of U.S. application Ser. No.11/047,355, filed Jan. 31, 2005, now U.S. Pat. No. 7,384,525, which is acontinuation of U.S. application Ser. No. 09/676,526, filed Oct. 2,2000, now U.S. Pat. No. 6,863,791, which is the nonprovisionalapplication of U.S. Provisional Patent Application No. 60/231,574, filedSep. 11, 2000; each of which is incorporated herein by reference in itsentirety.

INTRODUCTION

The present teachings are directed to electrophoresis equipment whichidentifies migrating species based on an analysis of detectedfluorescence levels. The present teachings are directed to suchequipment having an in-situ calibration capability so as to permitvarious dye sets to be used and a three dimensional graphicalrepresentation of results to allow for simplified base calling.

In the detector system in accordance with U.S. Pat. No. 6,027,627,fluoresced light from migrating species within a plurality ofcapillaries aligned in parallel passes through a filter, a transmissiongrating beam splitter and a lens before it impinges on a CCD detectorarray. In the preferred embodiment, the CCD detector array comprises1024×256 pixels. The first dimension, (1024 pixels) covers 96 parallelcapillaries, each capillary being focused onto at least one of the 1024rows, although the number of rows per capillary can be increased byselecting a lens with a different focal length or changing other opticalparameters. The second dimension (256 pixels) covers the fluorescencespectrum spread by the transmission grating.

In this prior art system, both the first order and second ordercomponents can be focused onto the detector array, although this is notan absolute requirement. What is required, however, is that a spectrum(such as represented by the 1^(st) order components) be created for eachcapillary and detected. The spectrum of interest should include thewavelengths of light at which the dyes are known to fluoresce. Thespectrum of interest for each capillary is spread over P contiguouspixels and these are divided into R channels of Q contiguous pixels,R=P/Q. R should be at least as large as the number of dyes M being usedand preferably is greater than this number.

The detector of this prior art system outputs a spectrum comprising Rlight intensity values for each capillary, each time that data isprovided to the associated processor. The processor then maps thespectrum of R intensity values for each capillary, onto values whichhelp determine which dye has been detected in that capillary. This istypically done by multiplying calibration coefficients by the vector ofintensity values, for each capillary.

The principle behind the calibration coefficients is that a spectrum ofreceived light intensities in each of the channels is caused by thespectrum of a single dye (tagging a corresponding base) weighted by theeffects (calibration coefficients) of the detection system. If I₀(n),I₁(n), . . . , I₉(n) represent the measured intensities of the R=10channels at the n^(th) set of outputs from the CCD (after preprocessingincluding detection, binning and baseline subtraction), B₀(n), B₁(n), .. . , B₃(n) is a vector representing the contribution (presence 1 orabsence 0) from the M=4 bases, and C_(ij) are coefficients of a known10×4 matrix which maps the bases onto the detected channels, we thenhave the following relationship:

$\begin{matrix}{\begin{pmatrix}{I_{0}(n)} \\{I_{1}(n)} \\{I_{2}(n)} \\\ldots \\\ldots \\{I_{9}(n)}\end{pmatrix} = {\begin{pmatrix}C_{00} & C_{01} & C_{02} & C_{03} \\C_{10} & C_{11} & C_{12} & C_{13} \\C_{20} & C_{21} & C_{22} & C_{23} \\\ldots & \ldots & \ldots & \ldots \\\ldots & \ldots & \ldots & \ldots \\C_{90} & C_{91} & C_{92} & C_{93}\end{pmatrix}\left( {{B_{0}(n)}{B_{1}(n)}{B_{2}(n)}{B_{3}(n)}} \right)}} & \left( {{Eq}.\mspace{14mu} 1} \right)\end{matrix}$

Eq. 1 can thus be rewritten as:

I(n)=CB(n)  (Eq. 2)

Given a vector of intensities output by a CCD for each separation lane,the theory of determining the presence or absence of each of the M=4bases from the R=10 wavelength channels is fairly well established. Thisis simply a particular case of an over-determined system in which asmaller number of unknowns is determined from a greater number ofequations. After mathematical transformation, Eq. 2 can be written as:

B(n)=(C ^(T) C)⁻¹ C ^(T) I(n)  (Eq.3)

where B₀(n), . . . , B₃(n) now represent the unknown values of theindividual bases as functions of time index n, each value beingreflective of the relative likelihood of the corresponding dye taggingthat base being present; I₀(n), I₁(n), . . . I₉(n) are the fluorescenceintensities of the ten channels, and C_(ij)'s are the coefficients ofwavelength i under known base j and where C^(T) is a transpose of thematrix C and A=(C^(T)C)^(−I)C^(T) is the pseudo-inverse of matrix C.While in the above analysis, C is a 10×4 matrix because a total of tenchannels and four bases are used, in the general case, C is an R×Mmatrix wherein R≧M, and R and M are both integers greater than 2.

Typically, in prior art systems, the calibration matrix C is determinedat the time the system is created. More particularly, calibration matrixC is specific to a set of dyes that are used, and is constant for allseparation lanes in a system. If such a prior art system is thenmodified, such as by upgrading to a new set of optical filters, thecalibration matrix C needs to be re-calibrated.

FIG. 1 illustrates two shortcomings of using a constant calibrationmatrix for all capillaries in a capillary array. As seen in FIG. 1, the0^(th) order spectral intensities 102 from each of the capillaries donot map onto the same pixel in corresponding pixel columns. Inparticular, the 0^(th) order spectral intensities from capillaries 7 and10, which are detected in their corresponding pixel columns 104, 106,respectively, do not fall on the same-positioned pixel as do the 0^(th)order spectral images from the remaining capillaries. Similarly, the1^(St) order spectral intensities 112 from capillaries 7 and 10 in thesesame columns also do not fall on the same-positioned pixels as do the1^(st) order spectral images from the remaining capillaries, but ratherare offset by a skew of a single pixel. A consequence of this skewness,which may be caused by improper arrangement of capillaries 7 and 10within the capillary array, is that the binning process for 1^(st) orderintensities from capillaries 7 and 10 results in a spectrum which wouldbe slightly different than if the binning process started one pixelover. As a result, using a single calibration matrix C for all thecapillaries, leads to imprecise results, when Eq. 3 is invoked to try toidentify the due causes of the detected fluorescence from capillaries 7and 10. As also illustrated in FIG. 1, the 1^(st) order spectral image114 for capillary 1 maps onto one more pixels than do the spectralimages of the remaining capillaries. This feature, which may be causedby such factors as coma, spherical aberrations, astigma, and lateralchromatic aberration of the grating image system, also may lead toimprecisions when binning followed by use of the calibration matrix asin Eq. 3.

In general, different dye sets have different spectra. As a consequence,each dye set has a different calibration matrix. Consequently, a furtherdisadvantage of using a single calibration matrix for a multi-laneseparation system is that one cannot run multiple dye sets in differentseparation lanes.

SUMMARY

In one aspect, the present teachings disclose a multi-laneelectrophoretic separation apparatus which simultaneously utilizesmultiple calibration matrices which calibrate for different dyes used totag migrating species.

In another aspect, the present teachings disclose a multi-laneelectrophoretic separation apparatus in which a calibration matrix iscalculated in-situ.

In yet another aspect, the present teachings disclose a multi-laneelectrophoretic separation apparatus in which a calibration matrix iscalculated for each lane.

In yet another aspect, the present teachings disclose a method forcalculating a calibration matrix from data acquired from a sample. Themethod comprises the steps of detecting emitted fluorescence spectrafrom a plurality of tagged migrating species, clustering the detectedpeaks into a number of groups, and then calculating calibrationcoefficients representative of at least some of the groups. Afterdetection, and prior to grouping, the peaks may be culled to ensure thatonly peaks having a high probability of being associated with aparticular group are used to calculate the calibration coefficients.

These and other features of the present teachings are set forth herein.

BRIEF DESCRIPTION OF THE DRAWINGS

The skilled artisan will understand that the drawings described beloware for illustration purposes only. The drawings are not intended tolimit the scope of the present teachings in any way.

FIG. 1 shows imaging of zero- and first-order components onto a detectorarray;

FIG. 2 shows a flow chart outlining the steps in calculating acoefficient matrix in accordance with the present teachings;

FIG. 3 presents the exemplary detected intensity over 10 channelsderived from the detector array;

FIGS. 4 a and 4 b show intermediate results from peak spacingdetermination;

FIG. 5 shows relative fluorescence of the candidate peak intensities foreach nucleotide;

FIG. 6 depicts the process for characterizing peaks which remain afterinitial culling;

FIG. 7 a represents plots for each cluster of nucleotides;

FIG. 7 b represents histograms for the isolated peaks;

FIG. 8 shows thinned peaks for seven channels of data for an example inwhich three dyes are to be calibrated;

FIG. 9 a shows coefficient plots for the three dyes used in conjunctionwith FIG. 8;

FIG. 9 b shows a histogram for the clustered peaks corresponding tothree dyes of FIG. 9 a;

FIG. 10 represents the deconvoluted data for each of the candidate dyesof FIGS. 8-9 b;

FIGS. 11 a-11 c present experimental results for identifying proteins;

FIG. 12 represents an experimental histogram of peaks from a DNAsequencing example.

FIGS. 13 a-13 b present calibration coefficient matrices for each offour dye sets commonly used in DNA sequencing;

FIGS. 14 (SEQ ID NO: 1), 15 (SEQ ID NO: 2) and 16 (SEQ ID NO: 3) showcontoured time-frequency plots for the beginning, middle and terminalportions, respectively, for an electrophoresis run of a DNA sample in asingle capillary; and

FIG. 17 shows typical morphologies seen in contoured time-frequencyplots, such as those shown in FIGS. 14-16.

DESCRIPTION OF VARIOUS EMBODIMENTS

A system on which the present teachings can be used is an automatedcapillary electrophoresis system, such as is described in U.S. Pat. No.6,027,627. A detector arrangement for such a system is shown in U.S.Pat. No. 5,998,796. The contents of both of these are incorporated byreference to the extent necessary to understand the present teachings.

The present teachings are described with reference to a detector systemin which a total of P=30 pixels are binned into R=10 wavelength channelsof Q=3 pixels each. The binning is done onboard the CCD array chip undersoftware control. For DNA sequencing, the number of dyes M is 4—one foreach nucleotide—and the spectrum of interest is in the range of 520 nmto 670 nm. Thus, the spectral resolution of the 10 wavelength channelsis about 15 nm each. During data collection, for each of the 96capillaries, 10 data points are offloaded each time the CCD array isread out and these values are stored for subsequent analysis.Furthermore, during an electrophoresis run, data from the CCD array isoffloaded periodically, at a sample rate of f samples per second. Thus,during a run which lasts time T, a total of N=fT samples are taken.

FIG. 2 presents a flowchart 200 depicting the general steps incalculating a coefficient matrix for each capillary in accordance withan embodiment of the present teachings. In step 202, R-channelfluorescence data from a single capillary is collected for apredetermined period of time. FIG. 3 presents a graphical illustrationof light intensity time series data for a total of R=10 channels. Instep 204, smoothing and baseline subtraction is performed on theoriginal data to eliminate trends. In step 206, peaks are identified inthe time domain. In step 208, peak widths and peak spacing metrics arecalculated. In step 210, the metrics are used in conjunction with theidentified peaks to eliminate peaks from consideration for forming thecoefficient matrices. In step 212, the remaining peaks are ranked sothat the strongest peaks are used. In step 214, the coefficients of thecalibration matrix are calculated. Finally, in step 216, the calibrationmatrix is used to perform spectral deconvolution to identify migratingsamples. These steps are now described in further detail.

Step 204—Data Smoothing and Baseline Subtractions. The raw data aresmoothed by Savitzky-Golay method for a few close points, e.g., 1, 3, 5,7, 9 points, as determined by a user of the present teachings. Ingeneral, the data would not be smoothed if 1 point is chosen. The baselines of the smoothed data in the ten channels are subtracted withsoftware that runs on the processor associated with the detector system.The software searches local minimum of every local section, for example,300 data points in a channel as a section. A straight line, baseline,connects the two minimums in the consecutive sections. The values of rawdata between the two local minimums are subtracted to the baselinevalue. The new values after the baseline subtraction and smoothing arestored for further processing. The order of data smoothing and baselinesubtraction can be reversed.Step 206—Peak-Picking in Time Domain. The properties of each wavelengthchannel after baseline subtraction are calculated before peak-picking.These properties include global average signal intensity, global averageintensity deviation between two consecutive points, local maximum andlocal average deviation in a predetermined number of sections,preferably 40.

$\begin{matrix}{{{global}\mspace{14mu} {average}{\mspace{11mu} \;}{intensity}\text{:}\mspace{14mu} I_{g,{ave}}} = \frac{\sum\limits_{j = m}^{0}\; I_{j}}{m}} & \left( {{Eq}.\mspace{14mu} 4} \right) \\{{{global}\mspace{14mu} {average}\mspace{14mu} {deviation}\text{:}\mspace{14mu} I} = {{g_{1}{dev}} = \frac{\sum\limits_{j = {m - 1}}^{0}\; {{I_{j + 1} - I_{j}}}}{m - 1}}} & \left( {{Eq}.\mspace{14mu} 5} \right) \\{{{local}\mspace{14mu} {maximum}\text{:}\mspace{14mu} I_{1,\max}} = {\max \left( {I_{k},I_{k + 1},I_{k + 2},\ldots \mspace{14mu},I_{k + 3}} \right)}} & \left( {{Eq}.\mspace{14mu} 6} \right) \\{{{local}\mspace{14mu} {average}\mspace{14mu} {deviation}\text{:}\mspace{11mu} I_{1,{dev}}} = \frac{\sum\limits_{j = {s - 1}}^{0}\; {{I_{k + j + 1} - I_{k + j}}}}{s - 1}} & \left( {{Eq}.\mspace{14mu} 7} \right)\end{matrix}$

where I_(j) represents the intensity at point j, m is the total numberof data points, s is the number of data points in a local section, and kis the starting point in the section.

The above four parameters for each of the ten channels, at appropriatepoints along the sampled intensity values, are used in a heuristicalgorithm for determining peaks. A point I_(j) in a given channel isconsidered to be a peak if it meets the following criteria:

(1) I_(j) is a local maximum among five consecutive points:I_(j)>I_(j−1)>I_(j−2) and I_(j)>I_(j+1)>I_(j+2);

(2) Ij is greater than 20% of the section maximum and is also greaterthan 40% of global average intensity: Ij>0.2I_(s,max);I_(j)>0.4I_(g,ave);

(3) At least one of the two edge deviations on either side of Ij must begreater than 70% of the section average deviation and greater than 20%of global average deviation: i.e., e.g.—

right edge deviation: (I_(j+1)−I_(j+3))/2>0.7I_(l,dev) and(I_(j+1)−I_(j+3))/2>0.2I_(g,dev), orleft edge deviation: (I_(j−1)−I_(j−3))/2>0.7I_(l, dev) and(I_(j−1)−I_(j−3))/2>0.2I_(g,dev), or both;

(4) Peak assembly. This is a process to remove a peak that happens inonly one channel (not physically sound) and to identify as the same peakif a peak maximum is shifted one frame due to mathematical manipulation,and then to determine band location in the time domain. Most of the peakmaximums in more than one channel happen at a specific time. At leasttwo channels have shown peaks at a specific time. Since the individualchannel has been carried out, baseline subtraction is done separately.Sometimes peak maximum may shift a frame in time domain. It is the samepeak if peak position is shifted a frame in different color channels.Peak intensities in all of the channels are summed in the time domainsshown in the figures.

FIG. 3 depicts a portion of the raw time series intensity data Xj, jrepresenting the time index of the sample, for each of the 10 channelsfrom a single capillary during a DNA sequencing run. At any giveninstant, only a few channels exhibit a peak because each of the fourdyes only has a finite bandwidth. The raw data intensity signals Xj fromeach of the 10 channels, for each capillary, are stored for futureprocessing to create the multicomponent matrix and also to identify thefluorescent species giving rise to the detected fluorescent intensities.The pick-picking process is carried out through all of the 10 traces togive the peak position in time domain. A peak due to the specific typeof molecules in the sample will show up at a specific time in more thana trace because of the overlapping spectra. For example in FIG. 3, apeak at 52 min has shown up in trace 3 to trace 9. The peak-pickingprogram will pick up the peak from trace 3 to trace 9. A peak at 51.9,just prior to the peak at 5000, has shown up from the trace 0 to trace6. At the specific time that a peak shows up in more than one channel,the peak intensities for all 10 channels are recorded for the dataprocessing step. The channel number of maximum intensity over all of theten channels is also recorded at the specific time.

Step 208—Peak Spacing Determination.

(a) Peak spacing in a local section. In the local section, peak spacingcan be considered as a constant. After all of the peaks are determinedfrom the last step shown in FIG. 4 b, average peak spacing Δt_(sp,ave)in the local section is calculated based on the all of the identifiedpeaks shown in FIGS. 4 a and 4 b. The average peak spacing is 12.5frames. A pair of peaks is retained for the coefficient calculation ifthe spacing of the consecutive spacing is bigger than 75% of the averagepeak spacing or: Δt_(sp)>0.75 Δt_(sp,ave). Thus, the peaks marked with Xin FIG. 4 b are rejected from the coefficient calculation.

(b) Identifying the overlapped peaks by peak-fitting software. Afterthese peaks are identified, peak widths can be identified withpeak-fitting software. In most electrophoresis separations, the peakscoming out at the first section of the electropherograms are usuallyvery sharp and the peaks in the late section of separation are usuallywide. However, the peak widths in a small local section, for example, in300 frames, are essentially the same. This concept is very important toresolving the temporal overlapping peaks in a local section. In DNAanalysis, the complete overlapping bands with different DNA size in timedomain are rare. Most of the overlap is confined to the rising ortailing edge of the peaks where one enters into the detection window andthe other is moving out the windows. The overlapping peaks often are 30%wider than single peaks in DNA separation. If the intensity of a peak ina channel is small, e.g. 20% of local maximum intensity, the peak widthwas not calculated due to its low intensity. The peak width and spacingat a specific moment can be calculated from the ten traces of the data.

Step 210. Peak Filtering & Spike Rejection. The width of a normal peakis usually between 4 to 20 frames. In contrast, spikes usually happen inone frame and appear as very sharp peaks. The spikes can result fromcosmic ray pickups by the camera, thermal noise due to overheating ofthe camera, and sample impurity. Spacing criteria: If the peak spacingis 75% greater than the average peak spacing, the two peaks are retainedfor the coefficient calculation. Another technique is to use both thepeak width and spacing. If the average of two widths of adjunct peaks attheir half intensity is bigger than peak spacing, the two peaks arerejected from the calculation of the matrix coefficients. There are twocases and rationale for the overlapping peaks. In one case, the twopeaks are from the same dye tagging to the DNA molecules. These are notseparated because of poor separation resolution. We found that this casewould not cause any problem in matrix calculation since it involves thesame dye. Nonetheless, we would prefer to reject these types of peaks inthe matrix calculation as a general rule of peak width. The other typeof case is where the two peaks are from different dyes tagging the DNAmolecules with size differences of 1 base pair. We found that these twodyes are usually somewhat separated in time domain, but not completelyresolved. Therefore the peak positions in all of the channels differ bya few (2-3) frame number. Peak fitting will consider them to beoverlapping peaks. Rejecting these bands is important for the matrixcalculation. Intensity criteria: If a peak whose maximum intensity isonly 20% of the average peak intensity in a local section, the peak isrejected for the calculation of matrix coefficients. The small peak willcause significant errors for the matrix coefficients. FIG. 5 shows therelative fluorescence candidate peak intensity plots, one for each ofthe four nucleotides. The peaks labeled with * are rejected due to thespacing criteria. Most of these rejected peaks happen in the context ofG after A. The mobility shift causes the two peaks to overlap.Step 212. Band Categorization (Clustering). If a band has passed theabove-described filtering process, the band will go to the bandcategorizing (clustering) process. The band intensity is determined froma data channel that is the sum of the intensity over all of thewavelength channels. This channel signal, in most cases, is from0^(th)-order of the grating, which has no color-dispersed power. Anothertechnique is to create this channel of the data that is the sum of theintensities over all of the channels.

FIG. 6 shows a flowchart 600 representing the characterization of theremaining peaks, which thus correspond to the remaining detected bands.In step 602, the remaining peak intensities are normalized. In step 604,the band categorization commences with the strongest band. In step 606,the normalized intensities within a spectral set are compared and onlythose which have a significant value above the noise level are retained.In step 608, the bands are clustered if the differences in correspondingnormalized coefficients are less than 5% of maximum intensity (0.05where the maximum has been normalized to 1.0). In step 610, the averageand standard deviation of the coefficients from a set of bands whichhave been clustered are calculated. Finally, in step 612, thecoefficients for the calibration matrix are calculated. The stepsdescribed above are best illustrated using examples.

Step 602—Normalizing Intensities. The following example is a set of dataextracted from FIG. 5 at a time of 53.89 min corresponding to base C.Table 1 shows the intensity values for this set of data, which isnormalized to 1.0.

TABLE 1 Raw Data, Coefficient Calculation, and Un-Comparable Data.Channel 0 1 2 3 4 5 6 7 8 9 Intensities 31 58 54 43 160 962 2538 27291691 840 Norm. Coeff 0.011 0.021 0.019 0.015 0.058 0.352 0.930 1 0.6190.307

Step 604—Band Clustering Starting with the Strongest Bands. The processof band pattern recognition starts from the strongest band, and thenmoves to the next strongest, and so on. If a band shows up in a fewchannels at a specific time as peaks, then the intensity is normalizedover all of the intensities in other channels as a matrix coefficient.There are certain advantages to choosing the band with the strongestintensity first, and then the second strongest, and so forth. Because ofinstrument noise, the coefficient calculation of the strongest bands ismore accurate than the low intensity bands. Accordingly, the effects ofthe leading and trailing portions of spurious peaks have lesser overalleffect on a stronger band, than on a weaker band.

Step 606—Intensity culling; noise effects; low intensity andcoefficients. In various implementations, the overall noise level fromall noise sources, such as shot noise, CCD reading noise, and CCD darknoise, is on the order of about 50 counts. Mathematical manipulation ofthe raw data, such as baseline subtraction and smoothing, can alsointroduce noise to the data. In some embodiments, the data intensity ischosen to be about three times (150 counts) the noise level, and so thisvalue is selected as a threshold. This criteria is consonant withconventional statistical principles. Thus, if the data intensity islower than 150 counts, it preferably is not used for band categorizing.For example, in Table 1, the data in channels 0, 1, 2 and 3 are lessthan 150 and so their coefficients 0.0114, 0.021, 0.0198, and 0.0158 arenot be used for categorizing. These coefficients are calledun-comparable coefficients, which are likely to cause calculationerrors, and so are discarded.

Step 608—Band categorizing. If the difference in the comparablecoefficients of two bands is less than 5% of maximum intensity (or 0.05units), the two bands are clustered as being in the same category. Table2 shows an example with 7 sets of coefficients, each set having beenindividually normalized. In the bands shown in Table 2, bands 1, 3, 4are in the same category, because none of their coefficients differ bymore than 0.05. However, band 1 and band 2 have coefficient differencesof more than 0.05 units and so are considered to be in differentcategories. Using the 5% rule, it is evident that bands 5 and 6 are inthe same category and band 7 forms its own category.

TABLE 2 Peak Clustering Band 1 0.015 0.016 0.014 0.011 0.043 0.305 0.8581 0.635 0.331 Band 2 0.021 0.024 0.039 0.218 0.718 1 0.700 0.425 0.2890.211 Band 3 0.012 0.012 0.018 0.017 0.055 0.322 0.879 1 0.628 0.321Band 4 0.004 0.020 0.017 0.016 0.054 0.306 0.861 1 0.627 0.334 Band 50.347 1 0.948 0.612 0.423 0.288 0.162 0.096 0.063 0.049 Band 6 0.336 10.952 0.612 0.427 0.290 0.183 0.105 0.066 0.047 Band 7 0.024 0.163 0.6651 0.756 0.436 0.292 0.199 0.118 0.075

Upon considering the data in Table 2, one may think it adequate toalways categorize bands based on the maximum normalized peak. This,however, is not always the best approach. In some cases, the channelhaving the maximum intensity can be in either of two close channels forthe same type of bands. For example, if two bands have theircoefficients of 0.9948, 1 and 1, 0.982 in, say, channels 2 and 3,respectively, one might consider the two bands to belong to differentcategories, if only a maximum intensity rule is used. However, a systemusing the 5% of the maximum intensity rule will always take these twopeaks as the same type of bands.

On occasion, a computer may automatically cluster the bands into moredye spectra than the number of dyes used in the electrophoresis. Thisresults in a fake cluster 720, as seen in FIG. 7 b. The fake clusterresults in a fake dye spectrum caused by overlapping peaks. The numberof such overlapping peaks, following the various processing and culling,is preferably small, as compared to the number of real DNA spectra. Inthe event such a fake spectrum (and the corresponding extra cluster)arises, one may increase the 5% rule to 7% to see whether the overlappedbands merge into one or more of the other clusters bands. Thecoefficients of these fake bands can be represented as a combination ofthe spectra of the high occurrence bands. If the low occurrence bandscan be written as a combination of the two high occurrence bands withtwo positive distributions, these types of flow occurrence bands arefake bands. After recognizing fake bands with two of the aboveproperties, these fake spectra are rejected from the coefficientcalculation.

Step 610—Standard deviation rejection. The average and the standarddeviation of each set of coefficients are calculated after the bandcategorizing process. If the deviations of the normalized coefficientsfor a given set are greater than 130% of the standard deviation, thecorresponding band should be rejected for the coefficient calculation.

Step 612—Coefficient calculation. After clustering, the coefficients ofthe sets within each of four clusters (one cluster for each nucleotide)can be plotted, as seen in FIG. 7 a, to verify that the clustering wasproperly performed and the desired number of clusters has resulted. Theaverage of the coefficients of each of the sets is then taken to form anR-length vector (®=10 in some embodiments), and each such R-lengthvector corresponds to one of the four columns in the coefficient matrixC.

Step 216. Color (Spectral) Deconvolution. During use, the pseudo-inverseof coefficient matrix C calculated for each separation lane is used tomap a detected set of intensities from that separation lane, onto adecision vector B, as given in Eq. 3. The position of the highest valuein decision vector B corresponds to the identity of the dye.

Applications of the teachings described above are now illustrated usingvarious examples.

Example A DNA Sequencing Analysis

Experimental conditions: capillary ID 75 um, OD 200 um, total length 80cm, effective length (from injection end to detection window) 55 cm.Separation voltage 150 v/cm (12 kV). 96 capillaries are arranged inparallel on a plane to form a capillary array.

Injection: 6 kV for 1 min. DNA sequencing sample: labeled PE BiosystemBigDye.

Excitation: all-line Ar ion laser emitting between at 450-520 nm (514.5nm and 488 nm are two strongest emission lines). Laser light is spreadover a 96-capillary array by cylindrical lenses. Detection: Nikon cameralens with focal length 85 mm and F1.4 is used to collect thefluorescence from the capillary array. The fluorescence then passthrough a longpass optical filter (cutoff 525 nm) (Optical Omaga Inc.,CT) and a transmission grating (Edmund Scientific, NY) and impinge on aCCD camera (PixelVision, Wash.). The resolution of the system is about 5nm/pixel. Every three consecutive pixels is binned and each channelrepresents the fluorescence intensity over 15 nm.

Gel and separation conditions. The gel is a 5% linear polymer gel with 7M. The DNA in FIG. 1 were separated at room temperature.

FIG. 3 shows the electropherograms of 10 wavelength channels for DNAsequencing in a time window from 42 min to 54 min. Top trace shows theblue channel at 525 nm. The next trace is the data at 540 nm and so on.The bottom channel shows the red channel at 650 nm. The traces areconstantly shifted for a better view.

FIG. 7 a shows the spectra profiles of several resulting DNA bands. Thebands are classified into four categories, each of which corresponds toone of the four bases. FIG. 7 b shows the number of bands in eachcategory, and group 720 (overlapped bands) are rejected from thecoefficient calculation. Most of the overlapped bands take place at Gimmediately after A. Since the DNA fragments ending with G move a littlefaster than those of A, the two peaks overlap during the instance of Gimmediately after A.

FIG. 5 shows the four traces that have automatically been deconvoluted.The four traces are fragments of G, A, T, C from the top to the bottom.The bands labeled with * are the bands that have been rejected.

Example B DNA Fragment Analysis

FIGS. 8, 9 and 10 show the data of DNA fragments. The gel in thisexperiment is 5% polymer gel without any urea. The temperature ofseparation was regulated at 80° C. The three types of dyes are used tolabel DNA fragments. The eight traces of the data are shown in FIG. 8,which is similar to DNA sequencing. The bands are automaticallyclassified into three types of spectra. Then the coefficients of thethree types of bands are automatically calculated to deconvolute intothree distinct traces shown in FIG. 10. Trace 3 is a standard sampleGeneScan 500 from PE Biosystem (CA). This section shows the DNA sizefrom 60 to 350 basepairs, specifically 75, 100, 139, 150, 160, 200, 250,300, 340, 350 basepairs. Since the intensity of this trace is lower thanthe other traces, the corresponding spectra profile 910 in FIG. 9 ashows more variation.

Example C CZE (Capillary Zone Electrophoresis) for Protein Separation

A similar setup has been used for capillary zone electrophoresis. Theprotein samples are injected into the individual capillary of a96-capillary array. The capillaries of ID 50 um, OD 150 um, total length35 cm and effective length of 25 cm are used for the experiment. Theseparation takes place at 150 V/cm. The borate buffer at pH 10.5 was theseparation medium. The samples are mixtures of proteins injected with avacuum (hydrodynamical injection). One standard with different emissionspectra from the proteins is added to the sample for quantitativeanalysis. The data of 6 wavelengths are collected to resolve the twounknowns as in FIG. 11 a. After the computer program picks up the bandsand then recognizes the spectra-pattern as in FIG. 11 c, the two tracesof data are shown in FIG. 11 b after the matrix deconvolution.

Example D Different Dye Sets Used in the Same Capillary Array

The teachings discussed herein have been used to automatically obtainthe calibration coefficients for different dye sets commonly used in DNAsequencing. The methodology includes peak classification, initial peakrejection, coefficient determination, refined peak rejection, and colorde-convolution.

(1) Peak classification. To automatically calibrate a single dye set, atagged DNA sample was introduced into a single capillary andelectrophoresced. Approximately 500 bases in a single electropherogramwere detected, each base giving rise to a peak within the set of 10channels. The peaks were then classified according to the channel inwhich their intensity was a maximum. First, peak positions andintensities were recorded and metrics such as average peak spacing inthe time domain and average peak intensity were also calculated. Ingeneral, when a peak shows up in one channel, a peak often shows up inthe other channels in the time domain at the same time. This is becauseeach member within a dye family causes some overlap among the 10contiguous channels. At the specific time that a peak shows up, theintensities of the peaks over the ten wavelength channels were comparedto determine in which of the 10 channels a peak exhibited maximumintensity. The channel numbers in which the maximum intensity of a peakwas found were recorded for each peak, and this was histogrammed. FIG.12 shows a histogram of the maximum intensities among the wavelengthchannels, indicating that peak maxima were most frequently detected inchannels 2, 4, 6 and 8. This corresponds to the spectral peak of thefour bases among the 10 contiguous channels. Thus, although some peakmaxima were found in all 10 channels, these four channels were dominant.

(2) Initial peak rejection. Three types of peaks were rejected prior tothe calculation of calibration coefficients. First, peaks whose maximumintensities did not fall into any of channels 2, 4, 6, 8 were rejectedand eliminated from consideration. Second, peaks which overlapped in thetime domain were also rejected. Two peaks were considered to overlap ifthe spacing between two adjacent peaks in time domain was smaller than80% of average spacing distance between peaks. Third, low intensitypeaks, defined as those peaks having a maximum peak intensity less than20% of the average peak intensity, were also rejected from furtherconsideration. After initial peak rejection, only about 300 of theoriginal 500 peaks remained as candidates for use in calculatingcalibration coefficients.

(3) Calculation of the average coefficients and their standarddeviation. The maximum intensity of the remaining 300 or so peaks wasfirst normalized to 1.0000, the normalization being done in thewavelength domain. In other words, if the maximum for a peak was inchannel 2, indicating a “G” base for a particular set of dyes, the 10coefficients for the “G” base for this particular peak were calculatedas the ratio of the intensity in each of the 10 channels to theintensity found in channel 2 for that peak. Thus, the set of calibrationcoefficients for base G is derived from those remaining 300 peaks whosemaximum intensity was found in channel 2 by normalizing each such peakin the wavelength domain and taking the averages of each of the 10 setsof coefficients. Similarly, the set of calibration coefficients for theA, T and C bases were calculated from those remaining 300 peaks whosemaximum intensities were found in channels 4, 6 and 8, respectively. The10 group coefficient averages and the 10 group standard deviations foreach of the four groupings (G, A, T and C) are then calculated forfurther processing.

(4) Additional peak rejection. If the difference between any one of the10 normalized coefficients for a peak within a particular group (G A, Tor C) and the group average for that coefficient is bigger than apredetermined times (e.g., 1.5 times) the group standard deviation forthat coefficient, that peak is rejected and not used in coefficientcalculations.

(5) Matrix formation. After the additional peak rejections have beenperformed, the average coefficients for each group are calculated toestablish the calibration matrix.

(6) Color deconvolution. Given the output from the detector, Equation 3is used in conjunction with the appropriate calibration matrix tocalculate the four base intensities. This results in color deconvolutionof the signals.

Calibration coefficient matrices were calculated for the SpectruMedixModel SCE 9610 Genetic Analysis System for each of the following dyesets: ABI BigDye terminator dye set, ABI Rhodamine terminator dye set,Amersham ET primer dye set, and Baylor Bodipy dye set. The resultingmatrices are shown in FIGS. 13 a-13 d. The data in the first column ofeach matrix represents the averaged intensity distribution over the 10channels for a base emitting the shortest wavelength. The second, thirdand fourth columns represent coefficients of the bases emittingincreasingly longer wavelengths during fluorescence, with the fourthcolumn representing the coefficients of a base that emits in the longestwavelength fluorescence.

As is known to those skilled in the art, Bodipy dyes have a narrowemission spectrum and small wavelength spacing (20 nm) between adjacentdyes. To accommodate Bodipy dyes, only two adjacent pixels, rather thanthree, were binned so as to give high spectral resolution. The newmatrix, which is based on two-pixel binning for each channel,dramatically enhances results using Bodipy dyes for DNA sequencing.

Because each lane in a multi-lane electrophoretic separation system canhave its own calibration matrix, one can use multiple dye sets at thesame time, only a single dye set being used to tag the sample in eachlane. This allows one to divide a sample into two or more moieties, tageach moiety with a different dye set, and compare the results ofperforming separation of the sample, as tagged with different dye sets.Thus, one can directly compare the performance of different dye setswithout changing instrument set-up, such as by using a different set offilters. In samples that have been separated using an array ofcapillaries, different combinations of the dye sets have been used totag samples, with each capillary having therein a sample tagged withonly one dye set.

Peak Selection and Basecalling Using 3-D Pattern Recognition

In the present teachings, various heuristic and statistical techniquesare used to select peaks whose underlying data are used to formcalibration matrices, especially in DNA sequencing applications. Analternative approach for selecting peaks to be used for coefficientcalculation is to identify solitary peaks in topographic plots oftime-frequency plots.

FIGS. 14-16 show time-frequency plots from sequencing a DNA sample, in acapillary, using the SpectruMedix SCE 9610 instrument. In FIGS. 14-16,the x-axis represents a time element, as manifested by the frame numberof the detector output. As seen in these figures, a single peak occupiesseveral frames in the time dimension, the exact number depending on therate of sample migration and the speed at which the fluorescence issampled by the detector. The y-axis represents the pixel position, whichrelates to wavelength at 5 nm/division starting at 520 nm, with a totalof 35 points, effectively serving as 35 channels, in the y dimension. Asalso seen in FIGS. 14-16, a single peak occupies more than one frame inthe wavelength dimension due to the overlapping spectrum of each dye.The contours associated with each peak, i.e., the flattened “z” axis,correspond to the intensity of that peak

FIGS. 14-16 exhibit peaks with different morphologies. Single, solitarypeaks 700 which do not overlap with other peaks are circular, orslightly oval in shape. The data corresponding to such isolated peakscan be used to create calibration coefficients. Peaks that have mergedtogether to form conjoined twin peaks 710, or multiple sets of connectedpeaks 720, preferably are rejected when calculating calibrationcoefficients. Thus, by first plotting the data in the form oftime-frequency plots, one can first identify solitary peaks and thengroup together solitary peaks corresponding to the same base (or othertagged species). Given the isolated peaks, their underlying data can beused to normalize each peak, and perform other operations necessary inthe calculation of the coefficient matrices. FIG. 14 shows an early partof the sample separation between base pairs 100-130. FIG. 15 shows amiddle part of the sample separation between base pairs 320-440, andFIG. 16 shows a terminal part of the time-frequency plots. It is notedthat the morphological features are better separated from one another inthe early and middle parts of the sample separation. This is because thecorresponding fragments are smaller, and therefore more distinct in thetime domain, as they migrate. Accordingly, when using morphology toidentify candidate peaks, one may prefer to use time-frequency plotsfrom shorter fragments, i.e., the fragments which migrate earlier on.

The plots of FIGS. 14-16 also suggest an alternative to using Equation 3to perform color deconvolution of the received channel data, inconjunction with a calibration matrix. This alternative is to directlyidentify the morphological shapes in a time-frequency plot. Thus, in thecase of FIGS. 14-16, one can perform direct basecalling without firsthaving to calculate calibration matrices. Direct basecalling may be moreaccurate when dealing with overlapped peaks because pairs of adjacentpeaks exhibit fairly consistent appearances.

FIG. 17 shows isolated examples of different morphologies. FIG. 17 ashows a single peak; FIG. 17 b shows double overlapped peaks with thesame base in which the twin peaks appear as an elongated oval in thetime domain; FIGS. 17 c and 17 d show double overlapped peaks with thedifferent bases; and FIGS. 17 e, 17 f and 17 g show three adjacentoverlapped peaks.

Identification of the solitary peaks, and direct basecalling, can eitherbe performed visually by humans, or automatically by using machine-basedimage processing or pattern recognition techniques well known to thoseskilled in the art of computer vision. Thus, in the case ofmachine-based processing, morphological filters can be used as templatesto identify the features seen in FIG. 17.

The section headings used herein are for organizational purposes onlyand are not to be construed as limiting the subject matter described inany way.

While the present teachings are described in conjunction with variousembodiments, it is not intended that the present teachings be limited tosuch embodiments. On the contrary, the present teachings encompassvarious alternatives, modifications, and equivalents, as will beappreciated by those of skill in the art.

1. A parallel electrophoresis system comprising a plurality of separation lanes, a detector and a processor connected to the detector, wherein light intensity is received at the detector from at least two different separation lanes, and processed using at least two different calibration matrices.
 2. A method for calibrating a separation system having at least one separation lane, comprising: clustering a plurality of sets of detector signals into a number n categories by using at least one clustering criterion, wherein: each set of detector signals corresponds to light intensities at each of a plurality of different wavelengths detected from each sample of a plurality of samples separated along the at least one separation lane; and the step of clustering comprises normalizing at least some detector signals of each set of detector signals by a respective normalization value to prepare a plurality of sets of normalized detector signals, each set comprising normalized detector signals; comparing corresponding normalized detector signals of different sets of normalized detector signals; and clustering normalized sets of detector signals that do not have compared corresponding normalized detector signals differing by more than the clustering criterion; and calibrating the separation system based on the clusters. 